For the citations to be useful in the modelling, we need to put them into groups of topics. This is challenging because the citations are so fragmented - we don’t have the complete text of the titles in the citations so we can’t use typical text analysis methods like clustering by words or topic modelling.
How many unique citations do we have?
# If we are resuming the analysis we can just start from here, and read in the data
# frame of citations.
xx <- readRDS( paste0(here::here(), "/analysis/data/derived_data/items_wth_refs.rds"))
length(unique(xx$.out))
## [1] 432899
How can we put them into groups?
One approach we can try is to identify clusters of similar citations according to their co-occurances in citation lists of articles. The idea is that similar articles will cite similar items. So if we can identify similar citations by looking at their co-occurances, then we can group them into clusters to use at taxa for the extinction-innovation modelling.
Here’s a table of pairwise counts for all citations in our sample:
library(tidyverse)
## -- Attaching packages ----------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1.9000 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0.9000 v stringr 1.3.0
## v readr 1.2.0 v forcats 0.3.0
## -- Conflicts -------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
# count how many times each pair of citations occurs together in a journal article.
# this is very slow, and produces a data frame with 27 million rows...
library(widyr)
title_word_pairs <-
xx %>%
pairwise_count(.out,
title,
sort = TRUE,
upper = FALSE)
title_word_pairs
## # A tibble: 27,236,353 x 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 BOURDIEU PIERRE 1977 OUTLINE THE~ GIDDENS A 1984 CONSTITUTION SO~ 47.
## 2 SHANKS M 1987 SOCIAL THEORY ARCH~ SHANKS M 1987 RECONSTRUCTING A~ 41.
## 3 BRONK R C 2009 RADIOCARBON V51 P~ REIMER PJ 2013 RADIOCARBON V55~ 40.
## 4 LATOUR BRUNO 1993 WE HAVE NEVER ~ LATOUR B 2005 REASSEMBLING SOC~ 37.
## 5 RAMSEY CB 2009 RADIOCARBON V51 P~ REIMER PJ 2009 RADIOCARBON V51~ 37.
## 6 MCBREARTY S 2000 J HUM EVOL V39 ~ HENSHILWOOD CS 2003 CURR ANTHR~ 35.
## 7 BOURDIEU PIERRE 1977 OUTLINE THE~ GIDDENS ANTHONY 1984 CONSTITUT~ 34.
## 8 BINFORD L 1981 BONES ANCIENT MEN~ LYMAN R L 1994 VERTEBRATE TAPH~ 34.
## 9 BINFORD L 1981 BONES ANCIENT MEN~ BINFORD L 1978 NUNAMIUT ETHNOA~ 33.
## 10 INGOLD T 2000 PERCEPTION ENV ESS~ TILLEY C 1994 PHENOMENOLOGY LA~ 33.
## # ... with 27,236,343 more rows
With these pairwise counts we can draw a network plot to get a sense of how meaningful the co-occurances are:
# network plot of citations: co-citations
library(ggplot2)
library(igraph)
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
library(ggraph)
library(tidygraph)
##
## Attaching package: 'tidygraph'
## The following object is masked from 'package:igraph':
##
## groups
## The following object is masked from 'package:stats':
##
## filter
set.seed(1234)
title_word_pairs %>%
filter(n >= 25) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n, edge_width = n), edge_colour = "cyan4") +
geom_node_point(size = 5) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_void()
We can do community detection on this network to identify clusters of similar citations, according to the frequency of their co-occurance:
title_word_pairs %>%
filter(n >= 25) %>%
graph_from_data_frame() %>%
as_tbl_graph() %>%
mutate(community = as.factor(group_edge_betweenness())) %>%
ggraph(layout = 'kk') +
geom_edge_link(aes(alpha = ..index..), show.legend = FALSE) +
geom_node_point(aes(colour = community), size = 7) +
geom_node_text(aes(label = name), repel = TRUE,
point.padding = unit(0.2, "lines")) +
theme_graph()
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x
## $y, : font family not found in Windows font database
And here we can get the community membership in a data frame with the citations:
citations_community <-
title_word_pairs %>%
filter(n >= 15) %>%
graph_from_data_frame() %>%
as_tbl_graph() %>%
mutate(community = as.factor(group_edge_betweenness())) %>%
as_tibble %>%
arrange(community)
# how many unique communities here?
length(unique(citations_community$community))
## [1] 38
We need columns of the community number, the earliest publication year in the community and the most recent citation year in the community (lineage-style data). We could also give each community a broader discipline category. ). In this case, only having ten communities (taxa) might be too small, as we wouldn’t really be capturing the range of different communities.
# to occurance format
xx_with_communities_occ_format <-
xx %>%
left_join(citations_community,
by = c(".out" = "name")) %>%
filter(year >= 1975) %>%
mutate(time_unit = year,
Species = community) %>%
mutate(citing_paper = paste0(authors,
" ",
year,
" ",
title,
" ",
journal)) %>%
select(citing_paper,
.out,
time_unit,
Species) %>%
group_by(.out) %>%
mutate(min_age = min(time_unit),
max_age = max(time_unit),
Status = "extinct") %>%
distinct(.out,
min_age,
max_age,
Status,
Species)
# drop items that are only cited in a single year
xx_with_communities_occ_format_multi_year <-
xx_with_communities_occ_format %>%
rename(Specimen = .out
) %>%
select(Specimen,
Status,
min_age,
max_age,
Species) %>%
mutate(span = max_age - min_age) %>%
filter(span != 0) %>%
select(-span)
# convert to to lineage format
xx_with_communities_lineage_format_multi_year <-
xx_with_communities_occ_format_multi_year %>%
group_by(Specimen) %>%
summarise(first_year = min(min_age),
last_year = max(max_age),
Species = Species)
head(xx_with_communities_lineage_format_multi_year)
## # A tibble: 6 x 4
## Specimen first_year last_year Species
## <chr> <dbl> <dbl> <fct>
## 1 1976 ISTORIZM ARKHEOLOGII 1982. 1993. <NA>
## 2 1979 KAOGU XUEBAO P27 1988. 1991. <NA>
## 3 1979 SHAANXI CHUTU SHANG 1988. 1991. <NA>
## 4 1995 NATURE V377 P372 1996. 1999. <NA>
## 5 "Ť\u008fÆÂ£Æ¥ XIA ZHENGJIE 2008 Ǭ¬Å\u009~ 2013. 2016. <NA>
## 6 "Å\u0088\u0098ĸ\u009cÇ\u0094\u009f LIU TU~ 2007. 2009. <NA>
We can look to see how dominant each citation community is over time to see trends in how archaeologists cite literature.
Here we see the number of citation communities in the literature over time:
explore_citation_communities <-
xx %>%
left_join(citations_community,
by = c(".out" = "name")) %>%
filter(year >= 1975) %>%
mutate(time_unit = year,
Species = community) %>%
mutate(citing_paper = paste0(authors,
" ",
year,
" ",
title,
" ",
journal)) %>%
select(citing_paper,
.out,
time_unit,
Species) %>%
filter(!is.na(Species))
# each community as a proportion of all communities in that year.
communities_per_year <-
explore_citation_communities %>%
group_by(Species,
time_unit) %>%
tally() %>%
ungroup() %>%
group_by(time_unit) %>%
mutate(perc = n / sum(n) * 100) %>%
ungroup()
communities_per_year %>%
select(time_unit, Species) %>%
group_by(time_unit) %>%
tally() %>%
ggplot(aes(time_unit,
n)) +
geom_col() +
theme_minimal() +
ggtitle("Number of citation communities per year")
Here we compute the percentage of each community in all of the cited works that we designate to a community, for each year. We only plot the communities that show a high range, because many show little variation over time.
We see some interesting trends:
# just plot the ones that have a wide range, who care about the flatliners
# get communities that range in percentage more than a certain amount
communities_per_year_wide_range <-
communities_per_year %>%
group_by(Species) %>%
summarise(range = max(perc) - min(perc)) %>%
arrange(desc(range)) %>%
filter(range >= 10)
# plot communities over time.
citation_communities_over_time <-
communities_per_year %>%
filter(Species %in% communities_per_year_wide_range$Species) %>%
ggplot(aes(time_unit,
perc,
colour = Species)) +
geom_point() +
geom_smooth(se = FALSE) +
scale_y_log10() +
theme_minimal() +
ggtitle("Change over time in citation communities in archaeological literature") +
xlab("Year") +
ylab("Percentage of all cited items in that year")
plotly::ggplotly(citation_communities_over_time)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
We can make a histogram showing the distribution of the number of years each community was cited for (community lifespan):
xx_with_communities_occ_format_multi_year_hist <-
xx_with_communities_occ_format_multi_year %>%
filter(!is.na(Species)) %>%
mutate(number_of_years_cited = max_age - min_age + 1)
# ggplot(refs_occurance_format_hist,
# aes(number_of_years_cited)) +
# geom_histogram()
hist(xx_with_communities_occ_format_multi_year_hist$number_of_years_cited,
main = paste("Number of years each article in a citation community was cited for"),
xlab = "Number of years cited")
library(pyrater)
# can't do it with occurance format: Error: cannot allocate vector of size 29.5 Mb
# so let's use the lineage format, where each citation only appears once
refs_lineage_format <-
xx_with_communities_occ_format_multi_year %>%
filter(!is.na(Species)) %>%
group_by(Species) %>%
summarise(first_year = min(min_age),
last_year = max(max_age)) %>%
mutate(taxa = "NA") %>%
select(taxa,
Species,
first_year,
last_year)
# must be c("clade", "species", "min_age", "max_age")
refs_species_pyrate_data <- lineage_pyrate(refs_lineage_format)
refs_species_pyrate_data$species_pyrate_data$clade <- 0
# We can preview each data frame:
lapply(refs_species_pyrate_data, head)
## $species_pyrate_data
## clade species ts te
## 1 0 1 30 0
## 2 0 2 37 0
## 3 0 3 35 0
## 4 0 4 36 0
## 5 0 5 42 0
## 6 0 6 21 0
##
## $fixed_clade_key
## # A tibble: 1 x 3
## Names `ID Number` Count
## <fct> <dbl> <int>
## 1 NA 1. 38
##
## $fixed_species_key
## Names ID Number
## 1 1 1
## 2 2 2
## 3 3 3
## 4 4 4
## 5 5 5
## 6 6 6
We write a text file to be used with the PyRate analysis later on. This file will appear in our current working directory
write.table(refs_species_pyrate_data$species_pyrate_data,
"lineage_pyrate_data.txt",
quote=FALSE,
sep="\t",
row.names = FALSE)
Prior to starting the PyRate analysis, it is good practice to get a sense of the data by developing a series of plots. Using the function below, the following plots are created
# The numeral indicates the x-axis increments with
# the number 5 indicating 5-year time bins
# windows() # if interactive
summary_plots(refs_species_pyrate_data$species_pyrate_data, 5)
From https://www.nature.com/articles/palcomms201619
# set the number of generations. 10,000,000 generations = 10 hrs
n_generations <- 1000000
library(pyrater)
# birth-death model
system(
paste0(
"C:/python27/python.exe",
" ",
system.file("Pyrate/PyRate.py", package = "pyrater"),
" -d ",
" lineage_pyrate_data.txt ",
" -A 2 ",
" lineage ",
" -mBDI ",
0,
" -n ",
10000
),
intern = TRUE, wait = FALSE
)
# elapsed time: sec
# immigration-death model
system(
paste0(
"C:/python27/python.exe",
" ",
system.file("Pyrate/PyRate.py", package = "pyrater"),
" -d ",
" lineage_pyrate_data.txt ",
" -A 2 ",
" lineage ",
" -mBDI ",
1,
" -n ",
10000
),
intern = TRUE, wait = FALSE
)
#elapsed time: "
This analysis uses the default settings of samples every 1,000 generations for 10,000,000 generations. Details about changes to PyRate default settings can be found in Silvestro et al. 2014. If access to a computing cluster is available, this is the preferred option. To reduce the run time of the analysis, simply adjust the number of generations from 10 million to 1 million.
Once the BDMCMC runs are finished, one can easy view the a rate through time plot by using the following commands:
pyrate_plot(path_to_python = "C:/python27/python.exe",
input_file = "pyrate_mcmc_logs/lineage_pyrate_data_0_BD_marginal_rates.log")
We can redraw these plots in R:
source(paste0(here::here(), "/pyrate_mcmc_logs/lineage_pyrate_data_0_BD_marginal_rates_RTT.r"))
age_rev <-
2017 + age # adjust for the most recent date in each of the datasets
# Origination
plot(
age_rev,
L_hpd_M95,
type = 'n',
ylim = c(0, max(L_hpd_M95)),
xlim = c(min(age_rev), max(age_rev)),
xlab = "Years",
ylab = "Rate of Event Per Lineage Per Time Unit",
main = "Origination"
)
plot_RTT(age_rev, L_hpd_M95, L_hpd_m95, L_mean, "navyblue")
# Extinction
plot(
age_rev,
M_hpd_M95,
type = 'n',
ylim = c(0, max(M_hpd_M95)),
xlim = c(min(age_rev), max(age_rev)),
xlab = "Years",
ylab = "Rate of Event Per Lineage Per Time Unit",
main = "Extinction"
)
plot_RTT(age_rev, M_hpd_M95, M_hpd_m95, M_mean, "red")
# Net Diversification
plot(
age_rev,
R_hpd_M95,
type = 'n',
ylim = c(min(R_hpd_m95), max(R_hpd_M95)),
xlim = c(min(age_rev), max(age_rev)),
xlab = "Years",
ylab = "Rate of Event Per Mineage Per Time Unit",
main = "Net Diversification"
)
plot_RTT(age_rev, R_hpd_M95, R_hpd_m95, R_mean, "darkgreen")